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Abstract 


In this work, we address the prob¬ 
lem to model all the nodes (words or 
phrases) in a dependency tree with the 
dense representations. We propose a 
recursive convolutional neural network 
(RCNN) architecture to capture syntac¬ 
tic and compositional-semantic represen¬ 
tations of phrases and words in a depen¬ 
dency tree. Different with the original re¬ 
cursive neural network, we introduce the 
convolution and pooling layers, which can 
model a variety of compositions by the 
feature maps and choose the most infor¬ 
mative compositions by the pooling lay¬ 
ers. Based on RCNN, we use a discrimina¬ 
tive model to re-rank a A:-best list of can¬ 
didate dependency parsing trees. The ex¬ 
periments show that RCNN is very effec¬ 
tive to improve the state-of-the-art depen¬ 
dency parsing on both English and Chi¬ 
nese datasets. 


1 Introduction 


Feature-based discriminative supervised models 
have achieved much progress in dependency pars¬ 


ing (Nivre, 2004 Yamada and Matsumoto, 2003 


McDonald et al., 20051, which typically use mil¬ 
lions of discrete binary features generated from a 
limited size training data. However, the ability of 
these models is restricted by the design of features. 
The number of features could be so large that the 
result models are too complicated for practical use 
and prone to overfit on training corpus due to data 
sparseness. 

Recently, many methods are proposed to learn 
various distributed representations on both syn¬ 
tax and semantics levels. These distributed repre¬ 
sentations have been extensively applied on many 
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Figure 1: Illustration of a RCNN unit. 


natural language processing (NFP) tasks, such as 
syntax ( [Turian et al., 2010} Mikolov et al., 2010 


Collobert et al., 201 ![ Chen and Manning, 20141 


and semantics (Huang et al., 2012} Mikolov et al.. 


20131. Distributed representations are to represent 


words (or phrase) by the dense, low-dimensional 
and real-valued vectors, which help address the 
curse of dimensionality and have better general¬ 
ization than discrete representations. 

For dependency parsing, Chen et al. (20141 
and [Bansal et al. (20f4l ) used the dense vectors 
(embeddings) to represent words or features and 
found these representations are complementary 
to the traditional discrete feature representation. 
However, these two methods only focus on the 
dense representations (embeddings) of words or 
features. These embeddings are pre-trained and 
keep unchanged in the training phase of parsing 
model, which cannot be optimized for the specific 
tasks. 

Besides, it is also important to represent the 
(unseen) phrases with dense vector in dependency 
parsing. Since the dependency tree is also in re¬ 
cursive structure, it is intuitive to use the recur¬ 
sive neural network (RNN), which is used for con¬ 
stituent parsing (Socher et al., 2013a I. However, 
recursive neural network can only process the bi¬ 
nary combination and is not suitable for depen¬ 
dency parsing, since a parent node may have two 
or more child nodes in dependency tree. 

In this work, we address the problem to rep- 































resent all level nodes (words or phrases) with 
dense representations in a dependency tree. We 
propose a recursive convolutional neural net¬ 
work (RCNN) architecture to capture syntac¬ 
tic and compositional-semantic representations of 
phrases and words. RCNN is a general architec¬ 
ture and can deal with k-ary parsing tree, there¬ 
fore it is very suitable for dependency parsing. For 
each node in a given dependency tree, we first use 
a RCNN unit to model the interactions between it 
and each of its children and choose the most infor¬ 
mative features by a pooling layer. Thus, we can 
apply the RCNN unit recursively to get the vector 
representation of the whole dependency tree. The 
output of each RCNN unit is used as the input of 
the RCNN unit of its parent node, until it outputs a 
single fixed-length vector at root node. Figure[^il- 
lustrates an example how a RCNN unit represents 
the phrases “a red bike” as continuous vectors. 

The contributions of this paper can be summa¬ 
rized as follows. 

• RCNN is a general architecture to model the 
distributed representations of a phrase or sen¬ 
tence with its dependency tree. Although 
RCNN is just used for the re-ranking of the 
dependency parser in this paper, it can be 
regarded as semantic modelling of text se¬ 
quences and handle the input sequences of 
varying length into a fixed-length vector. The 
parameters in RCNN can be learned jointly 
with some other NLP tasks, such as text clas¬ 
sification. 

• Each RCNN unit can model the complicated 
interactions of the head word and its children. 
Combined with a specific fask, RCNN can 
capture the most useful semantic and struc¬ 
ture information by the convolution and pool¬ 
ing layers. 


a, Del 

Figure 2: Illustration of a RNN unit. 

The idea of recursive neural networks (RNN) 
for natural language processing (NLP) is to train a 
deep learning model that can be applied to phrases 
and sentences, which have a grammatical structure 
( Pollack, 1990[[Socher et ah, 2013c| ). RNN can be 
also regarded as a general structure to model sen¬ 
tence. At every node in the tree, the contexts at the 
left and right children of the node are combined 
by a classical layer. The weights of the layer are 
shared across all nodes in the tree. The layer com¬ 
puted at the top node gives a representation for the 
whole sentence. 

Following the binary tree structure, RNN can 
assign a fixed-length vector to each word at the 
leaves of the tree, and combine word and phrase 
pairs recursively to create intermediate node vec¬ 
tors of the same length, eventually having one fi¬ 
nal vector representing the whole sentence. Multi¬ 
ple recursive combination functions have been ex¬ 
plored, from linear transformation matrices to ten¬ 
sor products ( [Socher et ah, 2013c] ). Figure [^illus¬ 
trates the architecture of RNN. 

The binary tree can be represented in the form 
of branching triplets (p —)• C 1 C 2 ). Each such triplet 
denotes that a parent node p has two children and 
each Cfc can be either a word or a non-terminal 
node in the tree. 

Given a labeled binary parse tree, 
{{P 2 —^ o.pi), {pi be)), the node represen¬ 
tations are computed by 



• When applied to the re-ranking model for 
parsing, RCNN improve the accuracy of base 
parser to make accurate parsing decisions. 
The experiments on two benchmark datasets 
show that RCNN outperforms the state-of- 
the-art models. 


2 Recursive Neural Network 

In this section, we briefly describe fhe recur¬ 


sive neural nefwork architeefure of (Socher et ah. 


2013a|. 
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( 1 ) 


where (pi, P 2 , a, b, c) are the vector representa¬ 
tion of (pi,P 2 ) o, b, c) respectively, which are de¬ 
noted by lowercase bold font letters; W is a matrix 
of parameters of the RNN. 

Based on RNN, [Socher et al. (2013a I intro¬ 
duced a compositional vector grammar, which 
uses the syntactically untied weights W to learn 
the syntactic-semantic, compositional vector rep¬ 
resentations. In order to compute the score of 
















how plausible of a syntactic constituent a parent 
is, RNN uses a single-unit linear layer for all pi. 

s{Pi) = V • Pi, (2) 


where v is a vector of parameters that need to be 
trained. This score will be used to find the high¬ 
est scoring tree. For more details on how standard 


RNN can be used for parsing, see (Socher et ah. 


20111 . 


Costa et al. (20031 applied recursive neural net¬ 


works to re-rank possible phrase attachments in an 
incremental constituency parser. Their work is the 
first to show that RNNs can capture enough in¬ 
formation to make the correct parsing decisions. 


Menchetti et al. (2005 1 used RNNs to re-rank dif¬ 


ferent constituency parses. For their results on full 
sentence parsing, they re-ranked candidate trees 


created by the Collins parser (Collins, 20031. 


3 Recursive Convolutional Neural 
Network 


The dependency grammar is a widely used syntac¬ 
tic structure, which directly reflects relationships 
among the words in a sentence. In a dependency 
tree, all nodes are terminal (words) and each node 
may have more than two children. Therefore, the 
standard RNN architecture is not suitable for de¬ 
pendency grammar since it is based on the binary 
tree. In this section, we propose a more general 
architecture, called recursive convolutional neu¬ 
ral network (RCNN), which borrows the idea of 
convolutional neural network (CNN) and can deal 
with to k-ary tree. 

3.1 RCNN Unit 

For ease of exposition, we first describe the ba¬ 
sic unit of RCNN. A RCNN unit is to model a 
head word and its children. Different from the 
constituent tree, the dependency tree does not have 
non-terminal nodes. Each node consists of a word 
and its POS tags. Each node should have a differ¬ 
ent interaction with its head node. 


Word Embeddings Given a word dictionary W, 
each word m G W is represented as a real-valued 
vector (word embedding) w G where m is the 
dimensionality of the vector space. The word em¬ 
beddings are then stacked into a embedding ma¬ 
trix M G Eor a word w G W, its cor¬ 

responding word embedding Embed{w) G is 
retrieved by the lookup table layer. The matrix M 
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Eigure 3: Architecture of a RCNN unit. 


is initialized with pre-training embeddings and up¬ 
dated by back-propagation. 

Distance Embeddings Besides word embed¬ 
dings, we also use distributed vector to represent 
the relative distance of a head word h and one of 
its children c. Eor example, as shown in Eigure [T] 
the relative distances of “bike” to “a” and “red” are 
-2 and -1, respectively. The relative distances also 
are mapped to a vector of dimension rrid (a hy¬ 
perparameter); this vector is randomly initialized. 
Distance embedding is a usual way to encode the 
distance information in neural model, which has 
been proven effectively in several tasks. Our ex¬ 
perimental results also show that the distance em¬ 
bedding gives more benefits than the traditional 
representation. The relative distance can encode 
the structure information of a subtree. 

Convolution The word and distance embed¬ 
dings are subsequently fed into the convolution 
component to model the interactions between two 
linked nodes. 

Different with standard RNN, there are no non¬ 
terminal nodes in dependency tree. Each node h 
in dependency tree has two associated distributed 
representations: 

1. word embedding w/^ G which is denoted 
as its own information according to its word 
form; 





































2. phrase representation x/j G M™, which is de¬ 
noted as the joint representation of the whole 
subtree rooted at h. In particular, when h is 
leaf node, x/j = w/^. 

Given a subtree rooted at h in dependency tree, 
we define c*, 0 < f < L as the i-th child node of 
h, where L represents the number of children. 

For each pair (/i, Cj), we use a convolutional 
hidden layer to compute their combination repre¬ 
sentation Zj. 

Zi = tanh(W(^’'=“)pi), 0 < f < iF, (3) 

where G is the linear composition 

matrix, which depends on the POS tags of h and 
Ci, Pi G M" is the concatenated representation of 
h and the i-th child, which consists of the head 
word embeddings Wh, the child phrase represen¬ 
tation Xcj and the distance embeddings of h 
and Ci, 

Pi = x/i ©Xc, © (4) 

where © represents the concatenation operation. 

The distances dh^a is the relative distance of h 
and a in a given sentence. Then, the relative dis¬ 
tances also are mapped to m-dimensional vectors. 
Different from constituent tree, the combination 
should consider the order or position of each child 
in dependency tree. 

In our model, we do not use the POS tags em¬ 
beddings directly. Since the composition matrix 
varies on the different pair of POS tags of h and 
Ci, it can capture the different syntactic combina¬ 
tions. For example, the combination of adjective 
and noun should be different with that of verb and 
noun. 

After the composition operations, we use tanh 
as the non-linear activation function to get a hid¬ 
den representation z. 

Max Pooling After convolution, we get = 
[zi,Z 2 ,--- where K is dynamic and de¬ 

pends on the number of children of h. To trans¬ 
form Z to a fixed length and determine the most 
useful semantic and structure information, we per¬ 
form a max pooling operation to Z on rows. 

xj-^^ = max Z^-^\ 0 < j <m. (5) 

Thus, we obtain the vector representation x/j G 
of the whole subtree rooted at node h. 

Figure]^ shows the architecture of our proposed 
RCNN unit. 
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Figure 4: Example of a RCNN unit 


Given a whole dependency tree, we can apply 
the RCNN unit recursively to get the vector rep¬ 
resentation of the whole sentence. The output of 
each RCNN unit is used as the input of the RCNN 
unit of its parent node. 

Thus, RCNN can be used to model the dis¬ 
tributed representations of a phrase or sentence 
with its dependency tree and applied to many NLP 
tasks. The parameters in RCNN can be learned 
jointly with the specific NLP fasks. Each RCNN 
unit can model the complicated interactions of the 
head word and its children. Combined with a spe¬ 
cific fask, RCNN can selecf the useful semantic 
and structure information by the convolution and 
max pooling layers. 

Ligure shows an example of RCNN to model 
the sentence “I eat sashimi with chopsitcks”. 


4 Parsing 


In order to measure the plausibility of a subtree 
rooted at h in dependency tree, we use a single¬ 
unit linear layer neural network to compute the 
score of its RCNN unit. 

Lor constituent parsing, the representation of a 
non-terminal node only depends on its two chil¬ 
dren. The combination is relative simple and its 
correctness can be measured with the final repre- 


senfafion of the non-terminal node (Socher et ah. 


2013al. 


However for dependency parsing, all combina¬ 
tions of the head h and its children Cj(0 < i < K) 
are important to measure the correctness of the 
subtree. Therefore, our score function s{h) is 
computed on all of hidden layers Zj(0 < z < K)\ 


K 

s(/i) = ^ • Zj, (6) 

i=l 


where G is the score vector, which 










also depends on the POS tags of h and q. 

Given a sentence x and its dependency tree y, 
the goodness of a complete tree is measured by 
summing the scores of all the RCNN units. 


We use a generalization of gradient descent 
called subgradient method (Ratliff et ah, 20071 
which computes a gradient-like direction. The 
subgradient of equation is: 


s{x,y,e) = '^s{h), (7) 

hey 

where h £ y is the node in tree y; 0 = 
{0Wj 0v, 0w, ©d} including the combination 
matrix set ©w, the score vector set ©v, the word 
embeddings ©w and distance embeddings ©j. 

Finally, we can predict dependency tree y with 
highest score for sentence x. 

y = argmax s(x, y, ©), (8) 

yegen{x) 

where gen(x) is defined as the set of all possible 
trees for sentence x. When applied in re-ranking, 
gen(x) is the set of the fc-best outputs of a base 
parser. 
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( 12 ) 


To minimize the objective, we use the diagonal 
variant of AdaGrad (Duchi et ah, 20111. The pa¬ 
rameter update for the f-th parameter ©t j at time 
step t is as follows: 


Gti — ©t—l.i 


P 






(13) 


where p is the initial learning rate and pr £ 
is the subgradient at time step r for parameter 9i. 


5 Training 


For a given training instance {xi,yi), we use the 
max-margin criterion to train our model. We first 
predict the dependency tree yi with the highest 
score for each xi and define a sfrucfured margin 
loss A{yi,yi) befween fhe predicfed free yi and 
fhe given correcf free yi. A{yi,yi) is measured 
by counfing fhe number of nodes y^ wifh an incor- 


recf span (or label) in fhe proposed free (Goodman, 


19981. 


^{yi,yi) = y Ki{d ^ yi\ 


(9) 


deiji 


where k is a discounf paramefer and d represenfs 
the nodes in trees. 

Given a set of training dependency parses P, 
the final fraining objecfive is fo minimize fhe loss 
funclion J(©), plus a ( 2 -regulalion ferm: 

= ^ E G(©) + ^ll©lli, (10) 

{xi,yi)eV 


where 


ri{G) = max ( 0, st(xi, yi, G) 

yieY(xi) 

+ ^{yhiii) - st{xi,yi,G)). ( 11 ) 

By minimizing fhis objecf, fhe score of fhe cor¬ 
recf free y* is increased and fhe score of fhe highesf 
scoring incorrecf free yi is decreased. 


6 Re-rankers 


Re-ranking fc-besf lisfs was infroduced by Collins 
land Koo (20051 1 and |Chamiak and Johnson (2005) . 
They used discriminative mefhods fo re-rank fhe 
consfifuenf parsing. In fhe dependency parsing. 


Sangafi ef al. (20091 used a fhird-order generafive 


model for re-ranking k-best lisfs of base parser. 
Hayashi et al. (2013| used a discriminative for¬ 


est re-ranking algorithm for dependency parsing. 
These re-ranking models achieved a substantial 
raise on the parsing performances. 

Given T(x), the set of A:-best trees of a sentence 
x from a base parser, we use the popular mixture 
re-ranking strategy (Hayashi et ah, 20131 Le and 


Mikolov, 20141, which is a combination of the our 


model and the base parser. 

yi = argmaxast(xi,y, ©) -f (1 - a)sbixi,y) 

y&T(xi) 

(14) 

where a £ [0,1] is a hyperparameter; st{xi, y, ©) 
and Sb{xi,y) are the scores given by RCNN and 
the base parser respectively. 

To apply RCNN into re-ranking model, we first 
get the /c-best outputs of all sentences in train 
set with a base parser. Thus, we can train the 
RCNN in a discriminative way and optimize the 
re-ranking strategy for a particular base parser. 

Note that the role of RCNN is not fully valued 
when applied in re-ranking model since that the 
gen(x) in Eq.(|^ is just the A:-best outputs of a base 





















parser, not the set of all possible trees for sentence 
X. The parameters of RCNN could overfit to k- 
best outputs of training set. 

7 Experiments 
7.1 Datasets 

To empirically demonstrate the effectiveness of 
our approach, we use two datasets in different lan¬ 
guages (English and Chinese) in our experimen¬ 
tal evaluation and compare our model against the 
other state-of-the-art methods using the unlabeled 
attachment score (UAS) metric ignoring punctua¬ 
tion. 



D D B ase Baser D D Re-ranker 


English For English dataset, we follow the stan¬ 
dard splits of Penn Treebank (PTB), using 
sections 2-21 for training, section 22 as de¬ 
velopment set and section 23 as test set. We 
tag the development and test sets using an au¬ 
tomatic POS tagger (at 97.2% accuracy), and 
tag the training set using four-way jackknif¬ 


ing similar to (Collins and Koo, 20051. 


Chinese For Chinese dataset, we follow the same 
split of the Penn Chinese Treeban (CTB5) 


as described in (Zhang and Clark, 20081 and 
use sections 001-815, 1001-1136 as training 
set, sections 886-931, 1148- 1151 as devel¬ 
opment set, and sections 816-885, 1137-1147 
as test set. Dependencies are converted by us¬ 
ing the Penn2Malt tool with the head-finding 
rules of ( Zhang and Clark, 2008 | l. And fol¬ 
lowing ( |Zhang and Clark, 2008 | l (Zhang and 
Nivre, 20TT]), we use gold segmentation and 


POS tags for the input. 

We use the linear-time incremental parser 


(Huang and Sagae, 20101 as our base parser and 
calculate the 64-best parses at the top cell of the 
chart. Note that we optimize the training settings 
for base parser and the results are slightly im¬ 
proved on ( Huang and Sagae, 2010| ). Then we use 
max-margin criterion to train RCNN. Finally, we 
use the mixture strategy to re-rank the top 64-best 
parses. 

For initialization of parameters, we train 


word2vec embeddings (Mikolov et ah, 20131 on 
Wikipedia corpus for English and Chinese respec¬ 
tively. For the combination matrices and score 
vectors, we use the random initialization within 
(0.01,0.01). The parameters which achieve the 
best unlabeled attachment score on the develop¬ 
ment set will be chosen for the final evaluation. 


Figure 6: Accuracies on the top ten POS tags of 
the modifier words with the largest improvements 
on the development set. 


7.2 English Dataset 

We first evaluate the performances of the RCNN 
and re-ranker (Eq. ( [T4| )) on the development set. 
Figure shows UASs of different models with 
varying k. The base parser achieves 92.45%. 
When k = 64, the oracle best of base parser 
achieves 97.34%, while the oracle worst achieves 
73.30% (-19.15%) . RCNN achieves the maxi¬ 
mum improvement of 93.00%(-i-0.55%) when k = 
6. When k > 6, the performance of RCNN de¬ 
clines with the increase of k but is still higher 
than baseline (92.45%). The reason behind this 
is that RCNN could require more negative sam¬ 
ples to avoid overfitting when k is large. Since the 
negative samples are limited in the fc-best outputs 
of a base parser, the learnt parameters could easily 
overfits to the training set. 

The mixture re-ranker achieves the maximum 
improvement of 93.50%(-i-1.05%) when k = 64. 
In mixture re-ranker, a is optimised by searching 
with the step-size 0.005. 

Therefore, we use the mixture re-ranker in the 
following experiments since it can take the advan¬ 
tages of both the RCNN and base models. 

Figure shows the accuracies on the top ten 
POS tags of the modifier words with the largest 
improvements. We can see that our re-ranker 
can improve the accuracies of CC and IN, and 
therefore may indirectly result in rising the the 
well-known coordinating conjunction and PP- 
attachment problems. 

The final experimental results on test set are 
shown in Table [T] The hyperparameters of our 
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Figure 5: UAS with varying k on the development set. Oracle best: always choosing the best result in the 
A;-best of base parser; Oracle worst: always choosing the worst result in the A:-best of base parser; RCNN: 
choosing the most probable candidate according to the score of RCNN; Re-ranker: a combination of the 
RCNN and base parser. 


model are set as in Table|2] Our re-ranker achieves 
the maximum improvement of 93.83%(-i-1.48%) 
on test set. Our system performs slightly better 


than many state-of-the-art systems such as Zhang 
and Clark (2008 | l and Huang and Sagae (2010| |. 


It outperforms Hayashi et al. (20131 and Le and 


[Zuidema (2014| ), which also use the mixture re¬ 
ranking strategy. 

Since the result of ranker is conditioned to k- 
best results of base parser, we also do an experi¬ 
ment to avoid this limitation by adding the oracle 
to A;-best candidates. With including oracle, the 
re-ranker can achieve 94.16% on UAS, which is 
shown in the last line (“our re-ranker (with ora¬ 
cle)”) of Table [T] 

7.3 Chinese Dataset 

We also make experiments on the Penn Chinese 
Treebank (CTB5). The hyperparameters is the 
same as the previous experiment on English except 
that a is optimised by searching with the step-size 
0.005. 

The final experimental results on the test set 
are shown in Table [3l Our re-ranker achieves the 
performance of 85.71%(-i-0.25%) on the test set, 
which also outperforms the previous state-of-the- 
art methods. With adding oracle, the re-ranker can 
achieve 87.43% on UAS, which is shown in the 
last line (“our re-ranker (with oracle)”) of Table 
Compared with the re-ranking model of |Hayashi et| 
al. (20131, that use a large number of handcrafted 



UAS 

Traditional Methods 

Zhang and Clark (20081 

91.4 

Huang and Sagae (2010) 

92.1 

Distributed Representations 

Stenetorp (20131 

86.25 

Chen et al. (2014) 

93.74 

Chen and Manning (2014) 

92.0 

Re-rankers 

Hayashi et al. (2013) 

93.12 

Le and Zuidema (2014) 

93.12 

Our baseline 

92.35 

Our re-ranker 

93.83(-tl.48) 

Our re-ranker (with oracle) 

94.16 


Table 1: Accuracy on English test set. Our base¬ 
line is the result of base parser; our re-ranker uses 
the mixture strategy on the 64-best outputs of base 
parser; our re-ranker(with oracle) is to add the or¬ 
acle to A;-best outputs of base parser. 


features, our model can achieve a competitive per¬ 
formance with the minimal feature engineering. 

7.4 Discussions 

The performance of the re-ranking model is af¬ 
fected by the base parser. The small divergence of 
the dependency trees in the output list also results 
to overfitting in training phase. Although our re¬ 
ranker outperforms the state-of-the-art methods, it 
can also benefit from improving the quality of the 






























































Word embedding size 

m = 25 

Disfance embedding size 

md = 25 

Inifial learning rafe 

p = 0.1 

Margin loss discounf 

II 

to 

o 

Regularization 

A = 10-4 

/s-besf 

A: = 64 


Table 2: Hyperparameters of our model 




UAS 

Traditional Methods 

Zhang and Clark (20081 

84.33 

iHuang and Sagae (2010 1 

85.20 

Distributed Representations 

Chen et al. (2014 1 

82.94 

iChen and Manning (2014 1 

83.9 

Re-rankers 

iHayashi et al. (2013 1 

85.9 

Our baseline 

85.46 

Our re-ranker 

85.71(+0.25) 

Our re-ranker (with oracle) 

87.43 


Table 3: Accuracy on Chinese test set. 


candidate results. It was also reported in other re¬ 
ranking works that a larger k (eg. k > 64) results 
the worse performance. We think the reason is that 
the oracle best increases when k is larger, but the 
oracle worst decrease with larger degree. The er¬ 
ror types increase greatly. The re-ranking model 
requires more negative samples to avoid overfit¬ 
ting. When k is larger, the number of negative 
samples also needs to multiply increase for train¬ 
ing. However, we just can obtain at most k neg¬ 
ative samples from the k-best outputs of the base 
parser. 

The experiments also show that the our model 
can achieves significant improvements by adding 
the oracles into the output lists of the base parser. 
This indicates that our model can be boosted by 
a better set of the candidate results, which can be 
implemented by combining the RCNN in the de¬ 
coding algorithm. 


8 Related Work 


There have been several works to use neural net¬ 
works and distributed representation for depen¬ 
dency parsing. 


Stenetorp (20131 attempted to build recursive 


neural networks for transition-based dependency 
parsing, however the empirical performance of his 


Figure 7: Example of a DT-RNN unit 


model is still unsatisfactory. Chen and Manning 


(2014|) improved the transition-based dependency 


parsing by representing all words, POS tags and 
arc labels as dense vectors, and modeled their in¬ 
teractions with neural network to make predictions 
of actions. Their methods aim to transition-based 
parsing and can not model the sentence in seman¬ 
tic vector space for other NLP tasks. 


Socher et al. (2013b I proposed a composi¬ 


tional vectors computed by dependency tree RNN 
(DT-RNN) to map sentences and images into a 
common embedding space. However, there are 
two major differences as follows. 1) They first 
summed up all child nodes into a dense vector Vc 
and then composed subtree representation from Vc 
and vector parent node. In contrast, our model 
first combine the parent and each child and then 
choose the most informative features with a pool¬ 
ing layer. 2) We represent the relative position 
of each child and its parent with distributed rep¬ 
resentation (position embeddings), which is very 
useful for convolutional layer. Figure [7] shows an 
example of DTRNN to illustrates how RCNN rep¬ 
resents phrases as continuous vectors. 


Specific fo fhe re-ranking model, Le and 


Zuidema (20141 proposed a generafive re-ranking 


model wifh Inside-Oufside Recursive Neural Nef- 
work (lORNN), which can process frees bofh 
boffom-up and fop-down. However, lORNN 
works in generafive way and jusf esfimafes fhe 
probabilify of a given free, so lORNN cannof fully 
utilize fhe incorrecf frees in fc-besf candidafe re- 
sulfs. Besides, lORNN freafs dependency free as a 
sequence, which can be regarded as a generaliza¬ 
tion of simple recurrenf neural nefwork (SRNN) 
( Elman, 1990| ). Unlike lORNN, our proposed 
RCNN is a discriminafive model and can opti¬ 
mize fhe re-ranking sfrafegy for a particular base 
parser. Anofher difference is fhaf RCNN compufes 
fhe score of free in a recursive way, which is more 
nafural for fhe hierarchical sfrucfure of nafural Ian- 







































guage. Besides, the RCNN ean not only be used 
for the re-ranking, but also be regarded as general 
model to represent sentence with its dependency 
tree. 

9 Conclusion 

In this work, we address the problem to rep¬ 
resent all level nodes (words or phrases) with 
dense representations in a dependency tree. We 
propose a recursive convolutional neural net¬ 
work (RCNN) architecture to capture the syntac¬ 
tic and compositional-semantic representations of 
phrases and words. RCNN is a general architec¬ 
ture and can deal with k-ary parsing tree, there¬ 
fore RCNN is very suitable for many NLP tasks 
to minimize the effort in feature engineering with 
a external dependency parser. Although RCNN 
is just used for the re-ranking of the dependency 
parser in this paper, it can be regarded as seman¬ 
tic modelling of text sequences and handle the in¬ 
put sequences of varying length into a fixed-length 
vector. The parameters in RCNN can be learned 
jointly with some other NLP tasks, such as text 
classification. 

For the future research, we will develop an inte¬ 
grated parser to combine RCNN with a decoding 
algorithm. We believe that the integrated parser 
can achieve better performance without the limi¬ 
tation of base parser. Moreover, we also wish to 
investigate the ability of our model for other NLP 
tasks. 
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